Data processing on a large scale
نویسندگان
چکیده
Within the last decade the high costs and complexity of Next Generation Sequencing (NGS) data organization put pressure on NGS data centres to organize convenient IT service infrastructures for automatic data management, processing and analyses. Our market analysis showed that existing applications processing NGS data were insufficiently documented, not extensible or strongly dependent on the underlying technical system. Thus, we were motivated to develop an automated job control system, called One Touch Pipeline (OTP), to ensure highest quality and cost reduction of data processing in terms of man power and time. The functionality of OTP encompasses all the relevant steps of the whole pipeline from sequence acquisition to data analysis using high-performance computing. The OTP provides a flexible solution for automatic processing of NGS sequence data generated by sequencing centres. User friendly web pages and the platform independence of the OTP application guarantees a sustainable solution in the long term. The crucial strengths of OTP are: •Automatic processing including alignment of NGS data. •Management and coordination of sequencing runs and associated metadata. •Generation of quality control (e.g. fastqc results, coverage rate) and scores. •Web based support for principal investigators, sequencing centres etc. •Monitoring of job activities and quality control. •Interfaces for automated export of raw data and results to ICGC, EGA and ENA repositories. •Automated distribution of jobs across the cluster system with 1600 nodes •Storage capacity: approximately 10 Petabytes. Projects using OTP application: the German ICGC projects PedBrain, MMML and Early Onset Prostate Cancer, the “Deutsches Epigenom Programm” (DEEP) as part of the International Human Epigenome Consortium (IHEC), the National Genome Research Network (NGFN), the Heidelberg Initiative for Personalized Medicine (HIPO). IWBBIO 2013. Proceedings Granada, 18-20 March, 2013 331
منابع مشابه
Curbing variations in packaging process through Six Sigma way in a large-scale food-processing industry
Indian industries need overall operational excellence for sustainable profitability and growth in the present age of global competitiveness. Among different quality and productivity improvement techniques, Six Sigma has emerged as one of the most effective breakthrough improvement strategies. Though Indian industries are exploring this improvement methodology to their advantage and reaping the ...
متن کاملDeveloping a New Method in Object Based Classification to Updating Large Scale Maps with Emphasis on Building Feature
According to the cities expansion, updating urban maps for urban planning is important and its effectiveness is depend on the information extraction / change detection accuracy. Information extraction methods are divided into two groups, including Pixel-Based (PB) and Object-Based (OB). OB analysis has overcome the limitations of PB analysis (producing salt-pepper results and features with hole...
متن کاملStatistical Wavelet-based Image Denoising using Scale Mixture of Normal Distributions with Adaptive Parameter Estimation
Removing noise from images is a challenging problem in digital image processing. This paper presents an image denoising method based on a maximum a posteriori (MAP) density function estimator, which is implemented in the wavelet domain because of its energy compaction property. The performance of the MAP estimator depends on the proposed model for noise-free wavelet coefficients. Thus in the wa...
متن کاملEvaluation of Close-Range Photogrammetric Technique for Deformation Monitoring of Large-Scale Structures: A review
Close-range photogrammetry has been used in many applications in recent decades in various fields such as industry, cultural heritage, medicine and civil engineering. As an important tool for displacement measurement and deformation monitoring, close-range photogrammetry has generally been employed in industrial plants, quality control and accidents. Although close-range photogrammetric applica...
متن کاملAn Efficient Data Replication Strategy in Large-Scale Data Grid Environments Based on Availability and Popularity
The data grid technology, which uses the scale of the Internet to solve storage limitation for the huge amount of data, has become one of the hot research topics. Recently, data replication strategies have been widely employed in distributed environment to copy frequently accessed data in suitable sites. The primary purposes are shortening distance of file transmission and achieving files from ...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013